Introduction - Forecasting¶

In the dynamic landscape of business, the ability to predict future trends and outcomes is invaluable. Forecasting stands as a cornerstone in strategic planning, offering a glimpse into potential future scenarios based on current data and historical trends. Its significance spans various sectors, from finance and marketing to supply chain management and beyond. Among the many applications of forecasting, the domain of box office predictions provides a compelling illustration of its utility and impact.

Box Office Forecasting¶

In this project, we extract box office information from BoxOfficeMojo.com (https://www.boxofficemojo.com), aiming to provide weekly box office predictions using time series analysis. Our approach starts with thorough exploratory data analysis, coupled with comprehensive data cleaning and preparation. Significant external events, such as the COVID-19 pandemic, are critical in influencing box office trends, and we factor in these and other external elements. The project tackles complex, real-world data, presenting a fascinating and challenging dataset. We employ a SARIMAX model, effectively capturing the seasonal patterns, trends, and cyclical nature inherent in the data. Looking ahead, further exploration into the influence of social media trends on box office performance could enhance the accuracy of our forecasts.

In [1]:
# import libraries
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.colors as mcolors
from statsmodels.tsa.statespace.sarimax import SARIMAX
import seaborn as sns

Exploratory Data Analysis (EDA)¶

Data Download and Cleaning¶

The data is downloaded from the accompanying database. Initial examination reveals the necessity of data cleaning. In particular, we use the following techniques:

  1. Conversion of the dates to python datetime (and additional cleaning of dates).
  2. Data from before 1997 is ignored due large amounts of missing data.
  3. Interpolation of missing dates during March 2020 (COVID-19).
  4. Notable decline in box office during the COVID-19 pandemic might be taken into consideration later.
In [2]:
# Data download from database
db_path = 'box_office_data.db'
try:
    conn = sqlite3.connect(db_path)

    query = "SELECT * FROM box_office_data"
    df = pd.read_sql_query(query, conn)
except Exception as e:
            print(f"An error occurred: {e}")
finally:
    conn.close()
In [3]:
df.head(15)
Out[3]:
Date Weekday DayOfYear DailyGross DailyChange WeeklyChange NoReleases No1Movie GrossOfNo1Movie SpecialOccasion
0 Jan 9 2024 Tuesday 9 6685011 +23% -66.6% 22 Anyone But You 1268320
1 Jan 8 2024 Monday 8 5435026 -71.3% -81.4% 24 Anyone But You 1003785
2 Jan 7 2024 Sunday 7 18939826 -40.4% -17.9% 34 Wonka 3638210
3 Jan 6 2024 Saturday 6 31767466 +28.6% -20.7% 34 Wonka 6130408
4 Jan 5 2024 Friday 5 24699285 +97.1% -33.9% 34 Night Swim 5246275
5 Jan 4 2024 Thursday 4 12533498 -12.1% -62.3% 39 Wonka 2703155
6 Jan 3 2024 Wednesday 3 14266846 -28.7% -57.9% 39 Wonka 3152613
7 Jan 2 2024 Tuesday 2 20006635 -31.5% -52.1% 39 Wonka 4601670
8 Jan 1 2024 Monday 1 29208719 -19.3% -48.5% 41 Wonka 6636962 New Year
9 Dec 31 2023 Sunday 365 23078184 -17.5% -52.6% 39 Wonka 5208897 New Year
10 Dec 30 2023 Saturday 364 40050370 +7.2% +38% 39 Wonka 8637841
11 Dec 29 2023 Friday 363 37348409 +12.3% -6.4% 39 Wonka 8630268
12 Dec 28 2023 Thursday 362 33261609 -1.9% +168.2% 40 Wonka 7988504
13 Dec 27 2023 Wednesday 361 33892628 -18.9% +251.5% 40 Wonka 8135639
14 Dec 26 2023 Tuesday 360 41788862 -28.6% +259.9% 40 Wonka 8970413
In [4]:
# 911 date was in the wrong format 
date_wrong = df[df.Date == 'Sep 119 2001'].index.values[0]
df.loc[date_wrong,'Date'] = 'Sep 11 2001'
In [5]:
# Transform the dates into datetime format
df['Date'] = pd.to_datetime(df['Date'], format='%b %d %Y')
In [6]:
# Sort by date
df.sort_values(by='Date',ascending=False,inplace=True)
In [7]:
# Drop wrong value from dataset
df.drop([13064],inplace=True)
In [8]:
# Reset index
df.reset_index(inplace=True)
In [9]:
df.drop(columns=['index'],inplace=True)
In [10]:
# Check for missing values (does not check for missing rows)
df.isna().sum()
Out[10]:
Date               0
Weekday            0
DayOfYear          0
DailyGross         0
DailyChange        0
WeeklyChange       0
NoReleases         0
No1Movie           0
GrossOfNo1Movie    0
SpecialOccasion    0
dtype: int64
In [11]:
# Drop daily and weekly change
df.drop(columns=['DailyChange','WeeklyChange'],inplace=True)

Older data missing¶

Since much of the data from before 1997 is missing and the data might be very different, we will ignore all data from before 1997. Additionally, some data during March 2020 is missing. We will interpolate this data. However, incorporating data from 2020 and 2021 decreases the quality of the forecast due to the strong external factors influencing the behavior.

In [12]:
# Create new column for the year
df['Year'] = df['Date'].dt.year

# Count the number of entries per year
data_per_year = df.groupby('Year').size()
In [13]:
# Plotting
plt.figure(figsize=(12, 6))
# Plot all data in blue
plt.bar(data_per_year.index, data_per_year.values, color='skyblue', label='Complete data')
# Overlay in red where entries do not cover all days
plt.bar(data_per_year[data_per_year < 365].index, data_per_year[data_per_year < 365].values, color='palevioletred', label='< 365 Entries')
# Overlay in black for years with less than 300 entries
plt.bar(data_per_year[data_per_year < 300].index, data_per_year[data_per_year < 300].values, color='goldenrod', label='< 300 Entries')
plt.title('Number of Data Entries per Year')
plt.xlabel('Year')
plt.ylabel('Number of Entries')
plt.legend()
plt.show()
In [14]:
df_2020 = df[df['Year']==2020].copy()
In [15]:
df_2020['Date_Shift1'] = df_2020['DayOfYear'].shift(1)
In [16]:
df_2020 = df_2020.dropna()
In [17]:
df_2020['TimeDelta'] = df_2020['Date_Shift1'] - df_2020['DayOfYear']
In [18]:
df_2020.head()
Out[18]:
Date Weekday DayOfYear DailyGross NoReleases No1Movie GrossOfNo1Movie SpecialOccasion Year Date_Shift1 TimeDelta
1117 2020-12-30 Wednesday 365 2876460 16 Wonder Woman 1984 1451131 COVID 2020 366.0 1.0
1118 2020-12-29 Tuesday 364 3063773 17 Wonder Woman 1984 1655240 COVID 2020 365.0 1.0
1119 2020-12-28 Monday 363 3085353 17 Wonder Woman 1984 1822877 COVID 2020 364.0 1.0
1120 2020-12-27 Sunday 362 5302360 27 Wonder Woman 1984 3408342 COVID 2020 363.0 1.0
1121 2020-12-26 Saturday 361 8403557 27 Wonder Woman 1984 5811188 COVID 2020 362.0 1.0
In [19]:
# Plotting
plt.figure(figsize=(12, 6))
# Plot all data in blue
plt.bar(df_2020.DayOfYear, df_2020.TimeDelta, color='red', label='Attention')
plt.bar(df_2020[df_2020['TimeDelta']<2].DayOfYear, df_2020[df_2020['TimeDelta']<2].TimeDelta, color='blue', label='Good values')
plt.title('Time Until next data entry')
plt.xlabel('Day of Year')
plt.ylabel('Number of Days')
plt.legend()
plt.show()
In [20]:
df_2020[df_2020['TimeDelta'] >= 2]
Out[20]:
Date Weekday DayOfYear DailyGross NoReleases No1Movie GrossOfNo1Movie SpecialOccasion Year Date_Shift1 TimeDelta
1389 2020-03-27 Friday 87 192 1 Be Natural: The Untold Story of Alice Guy-Blaché 192 COVID 2020 94.0 7.0
In [21]:
df_covid = df[df['SpecialOccasion']=='COVID']
In [22]:
# Set 'Date' as the index
df.set_index('Date', inplace=True)

# Filter data from 1997 onwards
df_post_1997 = df['1997':]
In [23]:
df_missing = pd.DataFrame({'Date': pd.date_range('2020-03-28', '2020-04-02', freq='D'),
                          'Weekday': ['Saturday','Sunday','Monday','Tuesday','Wednesday','Thursday'],
                          'DayOfYear': [k for k in range(88,94)],
                          'DailyGross': ['nan' for k in range(6)],
                          'NoReleases': ['nan' for k in range(6)],
                          'No1Movie': ['nan' for k in range(6)],
                          'SpecialOccasion': ['COVID' for k in range(6)],
                          'Year': ['2020' for k in range(6)]
                          })
In [24]:
df_missing.set_index('Date', inplace=True)
In [25]:
# Combine the original DataFrame with the missing dates DataFrame
df_combined = pd.concat([df, df_missing]).sort_index()
In [26]:
# Convert integer values to float for interpolation
df_combined['DailyGross'] = df_combined['DailyGross'].astype(float)
df_combined['NoReleases'] = df_combined['NoReleases'].astype(float)
In [27]:
# Linear interpolation of the missing data
df_2020 = df_combined.loc['2020'].interpolate()
In [28]:
df_combined.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 13456 entries, 1977-05-25 to 2024-01-21
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Weekday          13456 non-null  object 
 1   DayOfYear        13456 non-null  int64  
 2   DailyGross       13450 non-null  float64
 3   NoReleases       13450 non-null  float64
 4   No1Movie         13456 non-null  object 
 5   GrossOfNo1Movie  13450 non-null  float64
 6   SpecialOccasion  13456 non-null  object 
 7   Year             13456 non-null  object 
dtypes: float64(3), int64(1), object(4)
memory usage: 946.1+ KB
In [29]:
# Reintegrate interpolated data back into the main DataFrame
df_combined.update(df_2020)

Data Visualization¶

After preprocessing the data, we visualize the time series to get an overview over trends, seasonality and cycles. The data shows some very interesting patterns. There are clear weekly patterns (peak on weekends), some changes that look seasonal and when a new movie is antcipated, there is a sudden jump. This year Barbenheimer, Covid clearly visible, overall trend increasing, in Dec 2021 Spider Man no way home, in may 2019 Avengers Endgame etc, Star Wars Dec 2017 and Dec 2015. This year is missing jump due to strike in Hollywood. Most likely, we will ignore 2020 and 2021 for our analysis as they are very unlike the rest of the data and might make predictions worse. We will later compare predictions with and without this data. Also might drop data from before 2002 as it looks different. (Note how 911 caused people to skip crowded places for a bit).

Another thought: Jumps might be correlated with how long the no1 movie has been reigning. We might want to take this into consideration as a feature for the prediction.

After preprocessing our dataset, we embark on visualizing the time series data to gain insights into its underlying trends, seasonality, and cycles. Our analysis uncovers fascinating patterns within the dataset. Notably, we observe distinct weekly patterns, with peaks occurring on weekends. Additionally, there are discernible changes that exhibit seasonal characteristics. Further, whenever the release of a highly anticipated movie is imminent, we observe sudden spikes in the data. Notably, in December 2021, the release of "Spider-Man: No Way Home" caused a significant surge in box office figures, reminiscent of the effect observed in May 2019 with "Avengers: Endgame" as well as in May 2018 with "Avengers: Infinity War." Other standout moments include the December releases of the new "Star Wars" movies in 2017 and 2015. This year, "Barbie" and "Oppenheimer" produced a characteristic peak in July. However, it's worth noting that this year lacks the characteristic jump in box office revenue due to the 2023 Writers Guild of America strike.

Additionally, several noteworthy events stand out. The impact of the COVID-19 pandemic is unmistakable, as it led to substantial deviations in the data. Consequently, we are inclined to exclude the years 2020 and 2021 from our primary analysis, as they diverge significantly from the typical data patterns and may adversely impact predictive models. Later in our analysis, we will make comparisons between predictions with and without this data. Additionally, we discard data points preceding 2002, as they do not exhibit the same patterns as modern data.

For more accurate predictions, we consider that these significant jumps in box office revenue might be correlated with the duration of a movie's reign as the number one film. Consequently, we incorporate this as an additional feature in our prediction models, recognizing its potential impact on forecasting accuracy.

In [30]:
fig, axs = plt.subplots(27,1,figsize=(18,120))
for year in range(1997,2024):
    axs[year-1997].plot(df_combined.loc[str(year)].index, df_combined.loc[str(year)]['DailyGross'], label='Box Office Data',color='darkmagenta',linewidth=2,marker='o')
    axs[year-1997].title.set_text('Daily Box Office Total' + str(year))
    axs[year-1997].set_ylim([0, 120000000])

plt.show()
In [31]:
# Applying one-hot-encoding
df_no1_movie_one_hot = pd.get_dummies(df_combined, columns=['No1Movie'])

# Display the resulting DataFrame
df_no1_movie_one_hot.head()
Out[31]:
Weekday DayOfYear DailyGross NoReleases GrossOfNo1Movie SpecialOccasion Year No1Movie_10,000 BC No1Movie_12 Strong No1Movie_13 Going on 30 ... No1Movie_X2 No1Movie_Yes Man No1Movie_You Got Served No1Movie_You've Got Mail No1Movie_Zack and Miri Make a Porno No1Movie_Zero Dark Thirty No1Movie_Zombieland No1Movie_Zootopia No1Movie_nan No1Movie_xXx
Date
1977-05-25 Wednesday 145.0 254809.0 1.0 254809.0 1977 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1977-05-26 Thursday 146.0 238965.0 1.0 238965.0 1977 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1977-12-26 Monday 360.0 3026559.0 1.0 3026559.0 1977 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1980-05-21 Wednesday 142.0 1333305.0 1.0 1333305.0 1980 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1980-05-22 Thursday 143.0 995026.0 1.0 995026.0 1980 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 1219 columns

No 1 Movies¶

In the following plot, we're taking a closer look at how long movies have ruled the Box Office's top spot over the last three decades. What's intriguing is how the movie industry has evolved in nearly 30 years: There's been a surge in movie releases, especially those big-budget blockbusters that draw huge crowds. As a result, the time a movie spends as the number one Box Office champ has shrunk to less than a third of what it used to be (from 150 days for "Independence Day" to less than 50 days for "Avatar: The Way of Water").

This observation sheds light on the dynamic nature of the movie industry, with more releases and evolving audience tastes. It's a world where movies compete fiercely for that coveted top position in a landscape that's become increasingly competitive.

In [32]:
# Define time intervals
intervals = [('1996-2004', '1996-01-01', '2004-12-31'), ('2005-2014', '2005-01-01', '2014-12-31'), ('2015-2024', '2015-01-01', '2024-01-01')]

# Colors for the plots
colors = ['palevioletred', 'plum', 'mediumslateblue']

# Create subplots with 3 rows and 2 columns
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(18, 30))

for i, (title, start_date, end_date) in enumerate(intervals):
    # Filter data for the interval
    interval_data = df_no1_movie_one_hot.drop(columns=['DayOfYear','NoReleases','GrossOfNo1Movie','SpecialOccasion','DailyGross','Weekday','Year']).loc[start_date:end_date]

    # Summing up the days each movie was number one
    days_as_no1 = interval_data.sum().sort_values(ascending=False)

    

    # Distribution plot for the interval
    axes[i, 0].bar(days_as_no1[days_as_no1.values > 1].index, days_as_no1[days_as_no1.values > 1].values, color=colors[i])
    axes[i, 0].set_title(f'Distribution of Number One Days for Each Movie ({title})')
    axes[i, 0].set_xlabel('Movies')
    axes[i, 0].set_ylabel('Number of Days as Number One')
    axes[i, 0].set_xticks([])

    # Top 15 movies plot for the interval
    top_15_movies = days_as_no1.head(15)
    labels = [movie.replace('No1Movie_','') for movie in top_15_movies.index]
    axes[i, 1].bar(top_15_movies.index, top_15_movies.values, color=colors[i])
    axes[i, 1].set_title(f'Top 15 Movies with the Longest Time as Number One ({title})')
    axes[i, 1].set_xlabel('Movies')
    axes[i, 1].set_ylabel('Number of Days as Number One')
    axes[i, 1].set_xticks(range(15))
    axes[i, 1].set_xticklabels(labels, rotation=90)

fig.tight_layout()
plt.show()
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/782048247.py:24: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
  axes[i, 0].set_xticks([])
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/782048247.py:24: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
  axes[i, 0].set_xticks([])
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/782048247.py:24: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
  axes[i, 0].set_xticks([])

Special Occasions¶

When we compare the box office on regular days to holidays and other special occasions (not considering COVID), it becomes clear that holidays drive the Box Office in every year.

In [33]:
# Data split into regular days and special occasions
df_regular = df_combined[df_combined['SpecialOccasion']=='']
df_special = df_combined[df_combined['SpecialOccasion']!='']
df_special_nocovid = df_special[df_special['SpecialOccasion']!= 'COVID']
In [34]:
# Calculate mean box office for each year for regular and special days
df_gross_regular = df_regular.groupby('Year')['DailyGross'].mean()
df_gross_special_nocovid = df_special_nocovid.groupby('Year')['DailyGross'].mean()

# Plot the data for visualization
plt.figure(figsize=(15, 7))
plt.plot(df_gross_regular.index,df_gross_regular.values,color = 'skyblue',linewidth=4)
plt.plot(df_gross_special_nocovid.index,df_gross_special_nocovid.values,color = 'palevioletred',linewidth=4)
plt.legend(['Regular Days','Special Occasion'],fontsize=20)
plt.title('Box Office Average',fontsize=25)
plt.xlabel('Movies',fontsize=15)
plt.ylabel('Number of Days as Number One',fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.show()

Weekly patterns¶

Finally, for every year, we plot the weekly development of the daily box office. Clearly, there is a weekly pattern. Fridays usually show a jump in Box Office as the weekend starts and new movies are released. On weeks without highly expected releases, Saturdays tend to have a slightly higher box office. Additionally, holidays can cause jumps in the daily box office during the week.

In [35]:
# Add data on the week of the year
df_combined['WeekOfYear'] = df_combined.index.isocalendar().week
In [36]:
# Plotting box office data by weekday for each week of the year
unique_weeks = df_combined['WeekOfYear'].unique()

# Set up the colors for the plots
colors = plt.cm.RdPu(np.linspace(0.25, 1, len(unique_weeks)))

plt.figure(figsize=(15, 10))

fig, axs = plt.subplots(13,2,figsize=(18,120))
for year in range(1998,2024):
    for week, color in zip(unique_weeks, colors):
        week_data = df_combined.loc[str(year)][df_combined['WeekOfYear'].loc[str(year)] == week]
        axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')

    axs[(year-1998)//2,year%2].set_title('Box Office Data by Weekday for Each Week of ' + str(year))
    axs[(year-1998)//2,year%2].set_xlabel('Weekday')
    axs[(year-1998)//2,year%2].set_ylabel('Box Office')
    axs[(year-1998)//2,year%2].set_xticks(ticks=np.arange(7))
    axs[(year-1998)//2,year%2].set_ylim([0,200000000])
    #plt.legend()
    
plt.show()
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
  axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
  axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
  axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
  axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
  axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
  axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
  axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
  axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
  axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
  axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
  axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
  axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
  axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
  axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
  axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
  axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
<Figure size 1080x720 with 0 Axes>

Data Preprocessing¶

Finally, we create two different forecast datasets. One including COVID data and one not including COVID data. The COVID pandemic has had a large negative impact on the box office. So, including it, will decrease the forecast that we generate. Due to the 2023 Writer's Guild of America strike, many releases of larger studios have been delayed. This has a negative impact on the daily box office as well. So, including the COVID data might do a better job in the current market.

In [38]:
# Create dataframes for making predictions
df_covid = df_combined.drop(columns=['No1Movie','GrossOfNo1Movie'])
df_covid = df_covid.reset_index()
In [39]:
# Drop data before 2002
df_covid = df_covid.drop(index=range(5401))
In [40]:
# Create dataframe without COVID data (2020-2021)
df_no_covid = df_covid.drop(index=range(11975,12706))
df_no_covid = df_no_covid.reset_index().drop(columns=['index'])

Model building¶

Linear Regression¶

To find a basis for our model building, we begin by building a linear model and comparing the performance of data containing COVID data and not containing COVID data. It become clear here, that including the COVID data significantly decreases the model performance as this data does not show the same patterns as non-COVID data. In particular, if we include the COVID data, the model underestimates the forecasting data.

In [41]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_percentage_error
import numpy as np

# Assuming df is your DataFrame
# Encoding categorical data (assuming 'Weekday' is the only categorical feature for simplicity)
df_encoded = pd.get_dummies(df_covid, columns=['Weekday'])

# Splitting the data
train_df = df_encoded.iloc[:-7]  # All data except the last week
test_df = df_encoded.iloc[-7:]    # This week

# Selecting features and target
X_train = train_df.drop(['DailyGross', 'Date','SpecialOccasion','WeekOfYear'], axis=1)
y_train = train_df['DailyGross']
X_test = test_df.drop(['DailyGross', 'Date','SpecialOccasion','WeekOfYear'], axis=1)
y_test = test_df['DailyGross']

# Model Training
model = LinearRegression()
model.fit(X_train, y_train)

# Prediction
predictions = model.predict(X_test)

# Evaluation
mape = mean_absolute_percentage_error(y_test, predictions)
mse = mean_squared_error(y_test,predictions)
rmse = np.sqrt(mse)

print('The MAPE of the linear regression model (trained on full data) on the test data is: ',round(mape,2))
print('The RMSE of the linear regression model (trained on full data) on the test data is: ',round(rmse,2))
The MAPE of the linear regression model (trained on full data) on the test data is:  0.64
The RMSE of the linear regression model (trained on full data) on the test data is:  9918117.74
In [42]:
# Plotting actual vs predicted
plt.figure(figsize=(15, 7))
plt.plot(df_encoded.iloc[-150:].index, df_encoded.iloc[-150:].DailyGross, marker='o',markersize=10,label='Training Data',linewidth=3,color='cornflowerblue')
plt.plot(X_test.index, predictions, label='Predictions', color='fuchsia',linewidth=3,marker='^',markersize=10)
plt.title('Daily Box Office Gross: Actual vs Predictions (COVID data included)')
plt.xlabel('Date')
plt.ylabel('Daily Gross')
plt.legend()
plt.show()
In [45]:
# Assuming df is your DataFrame
# Encoding categorical data (assuming 'Weekday' is the only categorical feature for simplicity)
df_encoded = pd.get_dummies(df_no_covid.reset_index().drop(columns='index'), columns=['Weekday'])

# Splitting the data
train_df = df_encoded.iloc[:-7]  # All data except the last week
test_df = df_encoded.iloc[-7:]    # This week

# Selecting features and target
X_train = train_df.drop(['DailyGross', 'Date','SpecialOccasion','WeekOfYear'], axis=1)
y_train = train_df['DailyGross']
X_test = test_df.drop(['DailyGross', 'Date','SpecialOccasion','WeekOfYear'], axis=1)
y_test = test_df['DailyGross']
In [46]:
# Model Training
model = LinearRegression()
model.fit(X_train, y_train)

# Prediction
predictions = model.predict(X_test)

# Evaluation
mape = mean_absolute_percentage_error(y_test, predictions)
mse = mean_squared_error(y_test,predictions)
rmse = np.sqrt(mse)

print('The MAPE of the linear regression model (trained on restricted data) on the test data is: ',round(mape,2))
print('The RMSE of the linear regression model (trained on restricted data) on the test data is: ',round(rmse,2))
The MAPE of the linear regression model (trained on restricted data) on the test data is:  1.3
The RMSE of the linear regression model (trained on restricted data) on the test data is:  17359543.89

Observe that including the COVID data does a better job at predicting on the test data. So, here, we choose to work with the COVID data based on the simple linear Regression model. The next step is to find a better model. In particular, we use SARIMAX -- Seasonal model that combined Auto Regression with a Moving Average and eXogneous features.

SARIMAX model¶

Exogeneous Features¶

The next step is to gain a deeper understanding into trends and seasonal behavior.

  1. Trend: While there is an increasing trend, COVID has dampened this trend. This becomes clear when we look at the moving average of the time series. For the last two years, the moving average lies clearly below the trend.
  2. Seasonality: The periodogram of the data shows three types of seasonality in the data: (a) weekly, (b) halfweekly, and (c) half-yearly. We will take these seasons into consideration when building the model.
  3. Holidays: As we noted earlier, holidays increase the box office. So, we will take both American as well as Chinese holidays into consideration.

Moving Average and Trends¶

Including the COVID data, means that the trend (capturing only linear behavior) does not work well. The strong decrease of the box office during COVID can not very well be explained by a linear trend. When we exclude COVID data, the moving average still declines below the linear regression trend.

In [47]:
df = pd.get_dummies(df_covid, columns=['Weekday']) # Include COVID data

# 1. Calculate Moving Average
moving_average = df['DailyGross'].rolling(window=365,center=True).mean()

# 2. Linear Trend Forecast Plot
df['DateNum'] = df['Date'].apply(lambda x: x.toordinal())  # Convert dates to ordinal numbers for linear regression
model = LinearRegression()
model.fit(df[['DateNum']], df['DailyGross'])

# Predictions for trend line
df['Trend'] = model.predict(df[['DateNum']])

# Plotting the data and trend
plt.figure(figsize=(15, 7))
plt.plot(df.index, df['DailyGross'], label='Actual Data', color='cornflowerblue')
plt.plot(df.index, df['Trend'], label='Trend', color='fuchsia', linewidth=3)
plt.plot(df.index, moving_average,color='white',linewidth=3)
plt.title('Box Office Data with Linear Trend Line')
plt.xlabel('Date')
plt.ylabel('Daily Gross')
plt.legend()
plt.show()
In [48]:
df_encoded = df_encoded.reset_index().drop(columns='index')

Seasonality¶

Examining the periodogram indicates the presence of three forms of cyclical behavior:

  1. Weekly cycles
  2. Half-weekly cycles
  3. Half-yearly cycles

While dominated by the weekly cycles, we also observe half-weekly and half-yearly cycles.

In [49]:
import pandas as pd
import numpy as np
from scipy.signal import periodogram
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

# Assuming 'df' is your DataFrame and it has 'DailyGross' column with time series data
# Example: df = pd.read_csv('path_to_your_data.csv')

# Calculate the periodogram
fs=1
frequencies, spectrum = periodogram(df_covid.DailyGross,fs)

# Convert frequencies to periods (days per cycle)
periods = 1 / frequencies

# Plot the periodogram
plt.figure(figsize=(14, 6))
plt.plot(periods, spectrum, color='skyblue',linewidth=5)
plt.xscale('log')
#plt.yscale('log')
plt.title('Periodogram')
plt.xlabel('Period (days per cycle)')
plt.ylabel('Variance')

# Set x-axis to show periods as daily, weekly, monthly
formatter = FuncFormatter(lambda x, _: f'{int(x)}')
plt.gca().xaxis.set_major_formatter(formatter)

# Highlight daily, weekly, and monthly frequencies
half_weekly_freq = 2/7
weekly_freq = 1/7
half_yearly_freq = 2/365
count = 0
color = ['darkviolet','fuchsia','crimson']
for freq, label in zip([half_weekly_freq, weekly_freq, half_yearly_freq], ['Halfweekly', 'Weekly','Half-Yearly']):
    plt.axvline(1/freq, color=color[count], linestyle=':', alpha=0.7, label=f'{label} (1/{1/freq:.0f})',linewidth=5)
    count+=1

plt.legend()
plt.show()
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/3860585986.py:15: RuntimeWarning: divide by zero encountered in true_divide
  periods = 1 / frequencies

Holidays¶

Even though we model the domestic box office, including both US as well as Chinese holidays, increases the performance of the model. Other holidays do not increase the performance of the model on the validation data.

In [50]:
from datetime import date 
import holidays 
  
# Create list of holidays for the two largest contributors to box office
holidays_china = holidays.China(years = range(2002,2025)).items() 
holidays_us = holidays.UnitedStates(years = range(2002,2025)).items() 

# Create column in pd dateformat
df['DateTime'] =[pd.to_datetime(day).date() for day in df['Date']]

# Initialize columns for holidays
df['Holiday_US'] = 0
df['Holiday_China'] = 0


# One-hot encode holidays
for holiday in holidays_us:
    df.loc[df['DateTime'] == holiday[0], 'Holiday_US'] = 1

for holiday in holidays_china:
    df.loc[df['DateTime'] == holiday[0], 'Holiday_China'] = 1

Seasonality -- Including half-weekly and half-yearly features¶

We include half-weekly as well as helf-yearly features through sine and cosine.

In [51]:
# Half-weekly cycle: Using sine and cosine transformations with a period of 3.5 days (half a week)
df['sin_half_week'] = np.sin(2 * np.pi * df.Date.dt.dayofweek / 3.5)
df['cos_half_week'] = np.cos(2 * np.pi * df.Date.dt.dayofweek / 3.5)

# Half-yearly cycle: Using sine and cosine transformations with a period of 182.5 days (half a year)
day_of_year = df.DayOfYear
df['sin_half_year'] = np.sin(2 * np.pi * day_of_year / 365.25 * 2)
df['cos_half_year'] = np.cos(2 * np.pi * day_of_year / 365.25 * 2)

Model building¶

Finally, we build the SARIMAX model and hypertune it on the validation data.

In [53]:
exog_features = ['Weekday_Monday', 'Weekday_Tuesday', 'Weekday_Wednesday', 'Weekday_Thursday', 
                 'Weekday_Friday', 'Weekday_Saturday', 'Weekday_Sunday', 'sin_half_week', 
                 'cos_half_week', 'sin_half_year', 
                 'cos_half_year','Holiday_US','Holiday_China']

# Splitting the data into train and test sets
# The test set is the last week of the dataframe
train = df.iloc[:-7]
test = df.iloc[-7:]

# Target variable
y_train = train['DailyGross']
y_test = test['DailyGross']
exog_test = test[exog_features]

model = SARIMAX(y_train, order=(1, 1, 1), seasonal_order=(1, 1, 1, 7),  # Weekly seasonality
                exog=train[exog_features])
results = model.fit()

predictions = results.get_forecast(steps=7, exog=exog_test)
forecast_mean = predictions.predicted_mean
predictions_ci = predictions.conf_int()

# Plotting the results
plt.figure(figsize=(20, 12))
plt.plot(df.iloc[-150:].DailyGross,marker='o',markersize=10,label='Training Data',linewidth=3,color='cornflowerblue')
#plt.plot(test['DailyGross'], label='Actual Sales',linewidth=5)
plt.plot(forecast_mean,'--',label='Predicted Sales', color='fuchsia',linewidth=3,marker='^',markersize=10)
plt.fill_between(predictions_ci.index, predictions_ci.iloc[:, 0], predictions_ci.iloc[:, 1], color='k', alpha=.2)
plt.title('SARIMAX Model Predictions vs Actual')
plt.legend()
plt.show()
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =           19     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  1.73837D+01    |proj g|=  6.00908D-02
 This problem is unconstrained.
At iterate    5    f=  1.73498D+01    |proj g|=  1.36699D-02

At iterate   10    f=  1.73429D+01    |proj g|=  2.68181D-03

At iterate   15    f=  1.73412D+01    |proj g|=  5.45679D-04

At iterate   20    f=  1.73412D+01    |proj g|=  5.88306D-04

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
   19     22     24      1     0     0   1.008D-06   1.734D+01
  F =   17.341196901115687     

CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL            
In [54]:
# Calculating MAPE and RMSE
mape = mean_absolute_percentage_error(y_test, forecast_mean)
mse = mean_squared_error(y_test,forecast_mean)
rmse = np.sqrt(mse)

print('The MAPE of the hyptertuned SARIMAX model on the test data is: ',round(mape,2))
print('The RMSE of the hyptertuned SARIMAX model on the test data is: ',round(rmse,2))
The MAPE of the hyptertuned SARIMAX model on the test data is:  0.41
The RMSE of the hyptertuned SARIMAX model on the test data is:  5657773.21

Results and Future Directions¶

The SARIMAX model does a reasonable job at predicting the patterns of box office behavior. It captures the weekly up and downs more clearly than the linear model. However, there is still a lot of room for improvement. Despite implementing other models (like LSTM), it might be of interest to consider other features. For example, it could be interesting to incorporate social media trends into the data set to get better predictions of significant jumps (like "Barbenheimer" in July 2023).